Introduction

Increasing the sales and profit margins is one of the most crucial priority of business owners. Thus, an owner of a local cafe asked me do some analysis on three different data to increase the profits and to predict future profits.

Research questions

How could a cafe increases it’s profits?

To answer this question or to discover new information, or confirm an idea they already know, I plan to:

  • Discover what are the days that have many customers using time series on visitors data.
  • Discover what are the best seller item using the products data.
  • Predicting the profits for the next year using a Time Series.

The source and structure of the data

The visit data is from a device the owner of the Cafe put above the main door to count the visitors and have some extra variables.It has 7 variables and 8089 observations.

Variable Description
day The date
Time The hour of the date
ValueIn The number of visitors
ValueOut The number of visitors leaving the cafe
Turn In Rate(%) The rate of visitors turn in
OutsideTraffic The numbers of people outside the cafe

The items data from the sales website and it has 5 variables and 96 observations.

Variable Description
item The name of the item
count How many pieces have been sold
price The overall price
cost The cost of the items
profits The amount of money earned

The sales data is from the sales website that he is using directly and has the 6 variables and 338 observations.

Variable Description
Total sales The total number of sales
Items cost The cost of the sold items
Taxes Additional fee
Offers Discount or some offers
Profits The amount of money earned

Assumptions

  1. In visit data, the variables are positively skewed due to the large number of zeros and ones since it is hourly data.

  2. It is not enough data since the cafe is open for less that a year

Discovering what are the days that have many customers using the visitors data

## load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kknn)
library(ggrepel)
library(corrplot)
## corrplot 0.90 loaded
library(dplyr)       # for data manipulation
library(EnvStats)
## 
## Attaching package: 'EnvStats'
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## The following object is masked from 'package:base':
## 
##     print.default
library(RColorBrewer)
library(vip)         # for variable importance
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
#Time Series
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(forecast)
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ lubridate   1.7.10     ✓ feasts      0.2.2 
## ✓ tsibble     1.1.0      ✓ fable       0.3.1 
## ✓ tsibbledata 0.3.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()      masks base::date()
## x dplyr::filter()        masks stats::filter()
## x xts::first()           masks dplyr::first()
## x fabletools::forecast() masks forecast::forecast()
## x tsibble::index()       masks zoo::index()
## x tsibble::intersect()   masks base::intersect()
## x tsibble::interval()    masks lubridate::interval()
## x dplyr::lag()           masks stats::lag()
## x xts::last()            masks dplyr::last()
## x tsibble::setdiff()     masks base::setdiff()
## x tsibble::union()       masks base::union()
library(tsibble)
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice
# Ploting
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(ggpubr)  #To arrange plots on one page
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:forecast':
## 
##     gghistogram
# Modeling process
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.9      ✓ rsample      0.1.0 
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflows    0.2.3 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.1.0 
## ✓ parsnip      0.1.7      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.17
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x rlang::%@%()          masks purrr::%@%()
## x yardstick::accuracy() masks fabletools::accuracy(), forecast::accuracy()
## x rlang::as_function()  masks purrr::as_function()
## x scales::discard()     masks purrr::discard()
## x plotly::filter()      masks dplyr::filter(), stats::filter()
## x xts::first()          masks dplyr::first()
## x recipes::fixed()      masks stringr::fixed()
## x rlang::flatten()      masks purrr::flatten()
## x rlang::flatten_chr()  masks purrr::flatten_chr()
## x rlang::flatten_dbl()  masks purrr::flatten_dbl()
## x rlang::flatten_int()  masks purrr::flatten_int()
## x rlang::flatten_lgl()  masks purrr::flatten_lgl()
## x rlang::flatten_raw()  masks purrr::flatten_raw()
## x infer::generate()     masks fabletools::generate()
## x rlang::invoke()       masks purrr::invoke()
## x dplyr::lag()          masks stats::lag()
## x xts::last()           masks dplyr::last()
## x rlang::list_along()   masks purrr::list_along()
## x rlang::modify()       masks purrr::modify()
## x parsnip::null_model() masks fabletools::null_model()
## x rsample::populate()   masks Rcpp::populate()
## x rlang::prepend()      masks purrr::prepend()
## x yardstick::spec()     masks readr::spec()
## x rlang::splice()       masks purrr::splice()
## x recipes::step()       masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

Sales Data

Loading and Exploring Data

#Downloading the data
sales.data <- read.csv("~/Documents/GitHub/DS_Capstone_Cafe_Analysis/sales-data.csv")

Exploring some of the most important variables

# Ploting the response
Prof.plot <- ggplot(data=sales.data, aes(x=Profits)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("The amount of money earned") +
        NULL

## The rest of the variables
cost.plot <- ggplot(data=sales.data, aes(x=Items.cost)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("The cost of the sold items") +
        NULL


tax.plot <- ggplot(data=sales.data, aes(x=Taxes)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("Additional fee") +
        NULL


offer.plot <- ggplot(data=sales.data, aes(x=Offers)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("Discount or some offers") +
        NULL


tot.sale.plot <- ggplot(data=sales.data, aes(x= Total.sales)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("The total number of sales") +
        NULL

ggarrange(Prof.plot, cost.plot, tax.plot, offer.plot, tot.sale.plot + rremove("x.text"),
          ncol = 3, nrow = 2)

  • All the variables are positively skewed. A large number of zeros duo to the closing hours.

Time Series Analysis

ARIMA

  • It is a time series forecasting method. The phrase ARIMA stands for AutoRegressive Integrated Moving Average. The lags of the differenced series are referred to as Auto Regressive (AR).

Stationary

  • A stationary time series is one whose statistical properties are independent of the time at which the series is recorded.

Seasonality

  • It is a repeating pattern within a year. We can not see that in our data because we don’t have a record for a whole year.

Auto Regressive(AR)

  • It is similar to the basic linear regression, except each observation is regressed on the prior observation.

Moving Averages(MA)

  • It is a simple time series model designed to account for autocorrelation. It has a regression-like structure, but each observation is regressed on the prior innovation, which is not observed.
###########################################
# This file loads monthly sales per day
# We will use data to forecast 
###########################################

# Taking out the observations from 2020-11-26 until 2020-11-30 since there wont be monthly affect for this month
sales.data <- slice(sales.data,-c(303:333))

# Choosing only the data and Profits
ts.sales <- select(sales.data,c(1,6))

# Add some features to our dataset
ts.sales$Date = ymd(ts.sales$Date)
ts.sales$year = year(ts.sales$Date)
ts.sales$month = month(ts.sales$Date)
ts.sales$day = day(ts.sales$Date)



#Declare this as time series data 

# Daily data

## Create a daily Date object - helps my work on dates
inds <- seq(as.Date("2021-01-01"), as.Date("2021-10-29"), by = "day")

## Create a time series object
set.seed(25)
myseries1 <- ts(ts.sales[,2],     
           start = c(2021, as.numeric(format(inds[1], "%j"))),
           frequency = 302)
###########################################
# Preliminary Analysis
###########################################
# Time Plot 
autoplot(myseries1) + ggtitle("Time Plot: Daily Gross Profits") +
  ylab("Thousands of Riyals")

# Data has some trend. Investigate transformations. 
#Take the first difference of the data to remove the trend.

DY <- diff(myseries1)

# Time plot of the differenced data 

autoplot(DY) + 
  ggtitle("Time Plot: Change in Daily Gross Profits") +
  ylab("Thousands of Riyals")

# Series appears stationary, use to investigate seasonality.

#not working
# ggseasonplot(DY) + 
#   ggtitle("Seasonal Plot: Change in Daily Gross Profits") +
#   ylab("Thousands of Riyals")


# subseries plot
# not working 
#ggsubseriesplot(DY)


############################################
# ARMA MODEL
############################################

#This will plot the time series
ts.plot(myseries1, xlab="Year", ylab="Daily Gross Profits", main="Daily Gross Profits")

# This will fit in a line
abline(reg=lm(myseries1~time(myseries1)))

#Auto correlation
acf(myseries1)  #it describes how well the present value of the series is related with its past values

############################################
#Fitting the AR Model to the time series
############################################

AR <- arima(myseries1, order = c(1,0,0))
print(AR)
## 
## Call:
## arima(x = myseries1, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.5933  1030.6234
## s.e.  0.0464    64.3063
## 
## sigma^2 estimated as 211815:  log likelihood = -2318.27,  aic = 4642.55
#plotting the series along with the fitted values
ts.plot(myseries1)
AR_fit <- myseries1 - residuals(AR)
points(AR_fit, type = "l", col = 2, lty = 2)

#Forcasting using AR model

#Using predict() to make a 1-step forecast
predict_AR <- predict(AR)

#Obtaining the 1-step forecast using $pred[1]
predict_AR$pred[1]
## [1] 1296.144
#ALternatively Using predict to make 1-step through 10-step forecasts
predict(AR, n.ahead = 10)
## $pred
## Time Series:
## Start = c(2022, 6) 
## End = c(2022, 15) 
## Frequency = 302 
##  [1] 1296.144 1188.166 1124.099 1086.085 1063.531 1050.149 1042.208 1037.497
##  [9] 1034.702 1033.043
## 
## $se
## Time Series:
## Start = c(2022, 6) 
## End = c(2022, 15) 
## Frequency = 302 
##  [1] 460.2340 535.1484 559.1380 567.3420 570.2021 571.2055 571.5584 571.6826
##  [9] 571.7263 571.7417
#plotting the series plus the forecast and 95% prediction intervals
ts.plot(myseries1)
AR_forecast <- predict(AR, n.ahead = 10)$pred
AR_forecast_se <- predict(AR, n.ahead = 10)$se
points(AR_forecast, type = "l", col = 2)
points(AR_forecast - 2*AR_forecast_se, type = "l", col = 2, lty = 2)
points(AR_forecast + 2*AR_forecast_se, type = "l", col = 2, lty = 2)

############################################
#Fit the MA model 
############################################
#Fitting the MA model to AirPassengers
MA <- arima(myseries1, order = c(0,0,1))
print(MA)
## 
## Call:
## arima(x = myseries1, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.5851  1024.9978
## s.e.  0.0393    42.1326
## 
## sigma^2 estimated as 217433:  log likelihood = -2322.28,  aic = 4650.57
#plotting the series along with the MA fitted values
ts.plot(myseries1)
MA_fit <- myseries1 - resid(MA)
points(MA_fit, type = "l", col = 2, lty = 2)

#Forcasting using MA model
#Making a 1-step forecast based on MA
predict_MA <- predict(MA)

#Obtaining the 1-step forecast using $pred[1]
predict_MA$pred[1]
## [1] 1286.585
#Alternately Making a 1-step through 10-step forecast based on MA
predict(MA,n.ahead=10)
## $pred
## Time Series:
## Start = c(2022, 6) 
## End = c(2022, 15) 
## Frequency = 302 
##  [1] 1286.585 1024.998 1024.998 1024.998 1024.998 1024.998 1024.998 1024.998
##  [9] 1024.998 1024.998
## 
## $se
## Time Series:
## Start = c(2022, 6) 
## End = c(2022, 15) 
## Frequency = 302 
##  [1] 466.2969 540.2385 540.2385 540.2385 540.2385 540.2385 540.2385 540.2385
##  [9] 540.2385 540.2385
#Plotting the AIrPAssenger series plus the forecast and 95% prediction intervals
ts.plot(myseries1)
MA_forecasts <- predict(MA, n.ahead = 10)$pred
MA_forecast_se <- predict(MA, n.ahead = 10)$se
points(MA_forecasts, type = "l", col = 2)
points(MA_forecasts - 2*MA_forecast_se, type = "l", col = 2, lty = 2)
points(MA_forecasts + 2*MA_forecast_se, type = "l", col = 2, lty = 2)

#Choosing AR or MA: Exploiting ACF plots
# Find correlation between AR_fit and MA_fit
cor(AR_fit, MA_fit)
## [1] 0.8805198
# Find AIC of AR
AIC(AR)   #4642.548
## [1] 4642.548
# Find AIC of MA
AIC(MA) # 4650.568
## [1] 4650.568
# Find BIC of AR
BIC(AR)   #4653.728
## [1] 4653.728
# Find BIC of MA
BIC(MA)  #4661.749
## [1] 4661.749
#According to the lowest values of the AIC and BIC, I will go with the AR model.


############################################
#AUTO ARIMA MODEL
############################################

## use auto.arima to choose ARIMA terms

fit.arima <- auto.arima(myseries1)  #Residual SD = 6.472248
## Warning: The chosen seasonal unit root test encountered an error when testing for the first difference.
## From stl(): series is not periodic or has less than two periods
## 0 seasonal differences will be used. Consider using a different unit root test.
## forecast for next 100 time points
forecast_arima <- forecast(fit.arima, h = 100)

## plot it
plot(forecast_arima)

#Model evaluation
print(summary(fit.arima))
## Series: myseries1 
## ARIMA(5,1,2) 
## 
## Coefficients:
##          ar1      ar2      ar3      ar4      ar5      ma1     ma2
##       0.2977  -0.7274  -0.1919  -0.3678  -0.3828  -0.8115  0.5686
## s.e.  0.1240   0.0670   0.0808   0.0521   0.0809   0.1296  0.1086
## 
## sigma^2 estimated as 141620:  log likelihood=-2247.01
## AIC=4510.01   AICc=4510.5   BIC=4539.8
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -4.135021 371.3887 280.7149 -16.78163 39.05975 0.5019812
##                     ACF1
## Training set 0.005920115
checkresiduals(fit.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(5,1,2)
## Q* = 100.32, df = 54, p-value = 0.000132
## 
## Model df: 7.   Total lags used: 61
###########################################
#Fit ETS method
###########################################
fit_ets <- ets(myseries1) # Residual SD = 515.6988
## Warning in ets(myseries1): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.
print(summary(fit_ets))
## ETS(M,Ad,N) 
## 
## Call:
##  ets(y = myseries1) 
## 
##   Smoothing parameters:
##     alpha = 0.8683 
##     beta  = 1e-04 
##     phi   = 0.98 
## 
##   Initial states:
##     l = 1352.1537 
##     b = 181.3039 
## 
##   sigma:  0.5145
## 
##      AIC     AICc      BIC 
## 5539.413 5539.693 5561.774 
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -32.73531 515.6362 388.3124 -21.53877 47.69954 0.6943896
##                      ACF1
## Training set -0.001126532
checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,N)
## Q* = 479.4, df = 56, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 61
############################################
# Facebook Prophet
############################################

# Adjusting the data

sales.data <- select(sales.data, c(1,6))

names(sales.data) <- c('ds', 'y')

# Making a Forecast

fb.ts.s <- prophet(sales.data, yearly.seasonality=FALSE ,daily.seasonality=FALSE)

future.s <- make_future_dataframe(fb.ts.s , periods=365)

tail(future.s)
##             ds
## 667 2022-10-24
## 668 2022-10-25
## 669 2022-10-26
## 670 2022-10-27
## 671 2022-10-28
## 672 2022-10-29
# Forecasting

forecast.s <- predict(fb.ts.s, future.s)

tail(forecast.s[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds     yhat yhat_lower yhat_upper
## 667 2022-10-24 1676.799   861.9544   2403.852
## 668 2022-10-25 1719.657   934.6177   2586.065
## 669 2022-10-26 1805.081   991.6498   2599.370
## 670 2022-10-27 2159.184  1278.4996   2925.468
## 671 2022-10-28 2512.561  1682.1670   3297.642
## 672 2022-10-29 1878.252  1023.7380   2699.135
plot(fb.ts.s, forecast.s)

prophet_plot_components(fb.ts.s, forecast.s)

Visit Data

#Downloading the data

visit.dat <- read.csv("~/Documents/GitHub/DS_Capstone_Cafe_Analysis/Hourlydata.csv")

Exploring some of the most important variables

The response variable

VI.plot <- ggplot(data=visit.dat, aes(x=ValueIn)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("The number of visitors") +
        NULL
summary(visit.dat$ValueIn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   8.643  12.000 107.000

The rest of the variables

valueout.plot <- ggplot(data=visit.dat[!is.na(visit.dat$ValueOut),], aes(x=ValueOut)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("number of visitors leaving the cafe")+
        NULL
summary(visit.dat$ValueOut)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   8.307  11.000 122.000
OT.plot <- ggplot(data=visit.dat, aes(x=OutsideTraffic)) +
        geom_histogram(fill="black",bins = 30) +
        xlab(" The number of people outside the cafe ") +
        NULL
summary(visit.dat$OutsideTraffic)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    10.0   295.9    82.0  6562.0
TI.plot <- ggplot(data=visit.dat, aes(x=TurnInRate)) +
        geom_histogram(fill="black",bins = 30) +
        xlab("The rate of visitors turn in") +
        NULL
summary(visit.dat$TurnInRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   23.27   35.71  100.00
day.plot <-  ggplot(visit.dat, aes(fill = factor(Day), mean(ValueIn))) +
  geom_bar(position = "dodge") +
  xlab("The days")
  NULL
## NULL
table(visit.dat$Day)
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      1153      1152      1152      1152      1176      1152      1152
ggarrange(VI.plot, valueout.plot, OT.plot, TI.plot, day.plot + rremove("x.text"), ncol = 3, nrow = 2)

I decided to remove the variable Value out since it doesn’t add any value to the number of visitors.

# Removing the ValueOut variable

visit.dat <- select(visit.dat, -c(4))

Random split

set.seed(123)  # for reproducibility
visit.split  <- initial_split(visit.dat, prop = 0.7)
visit.train  <- training(visit.split)
visit.test   <- testing(visit.split)

Important features using Decision Trees

# create resampling procedure
set.seed(13)
kfold <- vfold_cv(visit.train, v = 5)

# Step 1: create ridge model object
dt_mod <- decision_tree(mode = "regression") %>% set_engine("rpart")

# Step 2: create model recipe
model_recipe <- recipe(
    ValueIn ~., 
    data = visit.train
  )
  
# Step 3: fit model workflow
dt_fit <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(dt_mod) %>%
  fit(data = visit.train)

# Step 4: results
dt_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: decision_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## n= 5662 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 5662 1.191322e+06  8.410456000  
##    2) Time=0:00,1:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,2:00,3:00,4:00,5:00,6:00,7:00,8:00,9:00 4062 1.231171e+05  1.985229000  
##      4) TurnInRate< 0.175 2838 4.996829e+00  0.001057082 *
##      5) TurnInRate>=0.175 1224 8.603299e+04  6.585784000  
##       10) Date=2020-11-26,2020-11-29,2020-12-03,2020-12-05,2020-12-06,2020-12-08,2020-12-14,2020-12-15,2020-12-16,2020-12-21,2020-12-23,2020-12-29,2021-01-03,2021-01-05,2021-01-10,2021-01-12,2021-01-13,2021-01-19,2021-01-21,2021-01-24,2021-01-25,2021-01-31,2021-02-02,2021-02-03,2021-02-04,2021-02-05,2021-02-06,2021-02-07,2021-02-08,2021-02-10,2021-02-12,2021-02-13,2021-02-14,2021-02-15,2021-02-16,2021-02-17,2021-02-18,2021-02-19,2021-02-20,2021-02-21,2021-02-22,2021-02-23,2021-02-24,2021-02-25,2021-02-26,2021-02-27,2021-02-28,2021-03-01,2021-03-02,2021-03-03,2021-03-04,2021-03-05,2021-03-06,2021-03-07,2021-03-08,2021-03-09,2021-03-10,2021-03-11,2021-03-12,2021-03-13,2021-03-14,2021-03-15,2021-03-16,2021-03-17,2021-03-18,2021-03-19,2021-03-20,2021-03-21,2021-03-22,2021-03-23,2021-03-24,2021-03-25,2021-03-26,2021-03-27,2021-03-28,2021-03-29,2021-03-30,2021-03-31,2021-04-01,2021-04-02,2021-04-03,2021-04-04,2021-04-05,2021-04-06,2021-04-07,2021-04-08,2021-04-09,2021-04-11,2021-04-12,2021-04-22,2021-04-23,2021-04-25,2021-04-26,2021-04-27,2021-04-28,2021-04-29,2021-05-02,2021-05-03,2021-05-05,2021-05-06,2021-05-07,2021-05-08,2021-05-09,2021-05-10,2021-05-11,2021-05-13,2021-05-14,2021-05-15,2021-05-16,2021-05-17,2021-05-18,2021-05-19,2021-05-20,2021-05-21,2021-05-22,2021-05-23,2021-05-24,2021-05-25,2021-05-26,2021-05-27,2021-05-28,2021-05-29,2021-05-30,2021-05-31,2021-06-01,2021-06-02,2021-06-03,2021-06-04,2021-06-05,2021-06-06,2021-06-07,2021-06-08,2021-06-09,2021-06-10,2021-06-11,2021-06-12,2021-06-13,2021-06-14,2021-06-15,2021-06-16,2021-06-17,2021-06-18,2021-06-19,2021-06-20,2021-06-21,2021-06-22,2021-06-23,2021-06-24,2021-06-25,2021-06-26,2021-06-27,2021-06-28,2021-06-29,2021-06-30,2021-07-01,2021-07-02,2021-07-04,2021-07-05,2021-07-06,2021-07-07,2021-07-08,2021-07-09,2021-07-10,2021-07-11,2021-07-12,2021-07-13,2021-07-14,2021-07-15,2021-07-16,2021-07-17,2021-07-18,2021-07-19,2021-07-20,2021-07-21,2021-07-22,2021-07-23,2021-07-24,2021-07-25,2021-07-26,2021-07-27,2021-07-28,2021-07-29,2021-07-30,2021-07-31,2021-08-01,2021-08-02,2021-08-03,2021-08-04,2021-08-05,2021-08-06,2021-08-07,2021-08-08,2021-08-09,2021-08-10,2021-08-11,2021-08-15,2021-08-16,2021-08-17,2021-08-18,2021-08-19,2021-08-20,2021-08-21,2021-08-22,2021-08-23,2021-08-24,2021-08-25,2021-08-26,2021-08-28,2021-08-29,2021-08-30,2021-08-31,2021-09-01,2021-09-02,2021-09-03,2021-09-05,2021-09-06,2021-09-07,2021-09-08,2021-09-09,2021-09-10,2021-09-11,2021-09-12,2021-09-13,2021-09-14,2021-09-15,2021-09-16,2021-09-18,2021-09-20,2021-09-21,2021-09-22,2021-09-23,2021-09-24,2021-09-25,2021-09-26,2021-09-27,2021-09-28,2021-09-29,2021-09-30,2021-10-01,2021-10-02,2021-10-03,2021-10-05,2021-10-06,2021-10-07,2021-10-08,2021-10-09,2021-10-10,2021-10-11,2021-10-12,2021-10-13,2021-10-14,2021-10-15,2021-10-17,2021-10-18,2021-10-20,2021-10-23,2021-10-24,2021-10-28 1136 2.486614e+04  5.040493000 *
##       11) Date=2020-11-27,2020-11-28,2020-11-30,2020-12-01,2020-12-02,2020-12-04,2020-12-07,2020-12-09,2020-12-10,2020-12-12,2020-12-17,2020-12-18,2020-12-19,2020-12-22,2020-12-24,2020-12-25,2020-12-27,2020-12-28,2020-12-31,2021-01-01,2021-01-02,2021-01-04,2021-01-06,2021-01-07,2021-01-08,2021-01-09,2021-01-11,2021-01-14,2021-01-15,2021-01-16,2021-01-17,2021-01-18,2021-01-20,2021-01-22,2021-01-23,2021-01-26,2021-01-27,2021-01-28,2021-01-29,2021-02-01,2021-04-24,2021-05-01,2021-05-04,2021-07-03,2021-08-14,2021-09-04,2021-10-16,2021-10-19,2021-10-21,2021-10-22,2021-10-25,2021-10-27 88 2.343590e+04 26.534090000 *
##    3) Time=17:00,18:00,19:00,20:00,21:00,22:00,23:00 1600 4.747788e+05 24.722500000  
##      6) Date=2020-11-26,2020-11-29,2020-11-30,2020-12-12,2020-12-13,2020-12-14,2020-12-15,2020-12-19,2020-12-20,2020-12-21,2020-12-22,2020-12-23,2020-12-27,2020-12-28,2020-12-29,2020-12-30,2021-01-03,2021-01-04,2021-01-11,2021-01-17,2021-01-21,2021-01-24,2021-01-25,2021-01-26,2021-01-27,2021-02-03,2021-02-05,2021-02-06,2021-02-07,2021-02-08,2021-02-09,2021-02-10,2021-02-11,2021-02-12,2021-02-13,2021-02-14,2021-02-15,2021-02-16,2021-02-17,2021-02-18,2021-02-19,2021-02-20,2021-02-21,2021-02-22,2021-02-23,2021-02-24,2021-02-25,2021-02-26,2021-02-27,2021-02-28,2021-03-01,2021-03-02,2021-03-03,2021-03-04,2021-03-05,2021-03-06,2021-03-07,2021-03-09,2021-03-11,2021-03-12,2021-03-13,2021-03-14,2021-03-15,2021-03-16,2021-03-17,2021-03-18,2021-03-20,2021-03-21,2021-03-24,2021-03-27,2021-03-28,2021-03-29,2021-03-31,2021-04-03,2021-04-04,2021-04-05,2021-04-07,2021-04-10,2021-04-11,2021-04-12,2021-04-13,2021-04-14,2021-04-15,2021-04-16,2021-04-17,2021-04-18,2021-04-19,2021-04-20,2021-04-21,2021-04-22,2021-04-23,2021-04-24,2021-04-25,2021-04-26,2021-04-27,2021-04-28,2021-04-29,2021-04-30,2021-05-01,2021-05-02,2021-05-03,2021-05-04,2021-05-05,2021-05-06,2021-05-07,2021-05-08,2021-05-09,2021-05-10,2021-05-11,2021-05-12,2021-05-13,2021-05-17,2021-05-19,2021-05-22,2021-05-23,2021-05-24,2021-05-25,2021-05-28,2021-05-30,2021-05-31,2021-06-01,2021-06-02,2021-06-05,2021-06-06,2021-06-07,2021-06-08,2021-06-12,2021-06-13,2021-06-15,2021-06-16,2021-06-19,2021-06-20,2021-06-22,2021-06-23,2021-06-26,2021-06-27,2021-06-28,2021-06-29,2021-07-04,2021-07-05,2021-07-06,2021-07-07,2021-07-10,2021-07-11,2021-07-12,2021-07-13,2021-07-14,2021-07-15,2021-07-16,2021-07-17,2021-07-18,2021-07-19,2021-07-20,2021-07-21,2021-07-24,2021-07-26,2021-07-27,2021-07-31,2021-08-01,2021-08-02,2021-08-03,2021-08-06,2021-08-07,2021-08-08,2021-08-09,2021-08-10,2021-08-15,2021-08-16,2021-08-17,2021-08-18,2021-08-22,2021-08-23,2021-08-24,2021-08-25,2021-08-28,2021-08-30,2021-08-31,2021-09-05,2021-09-06,2021-09-07,2021-09-08,2021-09-11,2021-09-12,2021-09-13,2021-09-14,2021-09-15,2021-09-18,2021-09-19,2021-09-20,2021-09-21,2021-09-26,2021-09-27,2021-09-29,2021-10-02,2021-10-03,2021-10-04,2021-10-05,2021-10-06,2021-10-07,2021-10-10,2021-10-11,2021-10-12,2021-10-13,2021-10-17,2021-10-18,2021-10-19,2021-10-20,2021-10-23,2021-10-24,2021-10-25,2021-10-26 979 1.062612e+05 16.598570000  
##       12) Date=2020-12-15,2021-02-05,2021-02-06,2021-02-07,2021-02-08,2021-02-09,2021-02-10,2021-02-11,2021-02-12,2021-02-15,2021-02-16,2021-02-17,2021-02-18,2021-02-19,2021-02-20,2021-02-21,2021-02-22,2021-02-23,2021-02-24,2021-02-25,2021-02-26,2021-02-27,2021-02-28,2021-03-01,2021-03-02,2021-03-03,2021-03-04,2021-03-05,2021-03-06,2021-03-13,2021-03-14,2021-04-11,2021-04-12,2021-04-13,2021-04-14,2021-04-15,2021-04-17,2021-04-18,2021-04-19,2021-04-21,2021-04-23,2021-04-25,2021-04-27,2021-04-28,2021-04-30,2021-05-01,2021-05-02,2021-05-03,2021-05-04,2021-05-05,2021-05-06,2021-05-07,2021-05-08,2021-05-09,2021-05-10,2021-05-11,2021-05-12,2021-05-24,2021-06-02,2021-06-06,2021-07-04,2021-07-07,2021-07-11,2021-07-13,2021-07-18,2021-07-19,2021-07-21,2021-07-24,2021-08-08,2021-08-17,2021-08-30,2021-08-31,2021-09-20,2021-09-26,2021-10-03,2021-10-07,2021-10-10,2021-10-19 368 2.056854e+04  9.339674000 *
##       13) Date=2020-11-26,2020-11-29,2020-11-30,2020-12-12,2020-12-13,2020-12-14,2020-12-19,2020-12-20,2020-12-21,2020-12-22,2020-12-23,2020-12-27,2020-12-28,2020-12-29,2020-12-30,2021-01-03,2021-01-04,2021-01-11,2021-01-17,2021-01-21,2021-01-24,2021-01-25,2021-01-26,2021-01-27,2021-02-03,2021-02-13,2021-02-14,2021-03-07,2021-03-09,2021-03-11,2021-03-12,2021-03-15,2021-03-16,2021-03-17,2021-03-18,2021-03-20,2021-03-21,2021-03-24,2021-03-27,2021-03-28,2021-03-29,2021-03-31,2021-04-03,2021-04-04,2021-04-05,2021-04-07,2021-04-10,2021-04-16,2021-04-20,2021-04-22,2021-04-24,2021-04-26,2021-04-29,2021-05-13,2021-05-17,2021-05-19,2021-05-22,2021-05-23,2021-05-25,2021-05-28,2021-05-30,2021-05-31,2021-06-01,2021-06-05,2021-06-07,2021-06-08,2021-06-12,2021-06-13,2021-06-15,2021-06-16,2021-06-19,2021-06-20,2021-06-22,2021-06-23,2021-06-26,2021-06-27,2021-06-28,2021-06-29,2021-07-05,2021-07-06,2021-07-10,2021-07-12,2021-07-14,2021-07-15,2021-07-16,2021-07-17,2021-07-20,2021-07-26,2021-07-27,2021-07-31,2021-08-01,2021-08-02,2021-08-03,2021-08-06,2021-08-07,2021-08-09,2021-08-10,2021-08-15,2021-08-16,2021-08-18,2021-08-22,2021-08-23,2021-08-24,2021-08-25,2021-08-28,2021-09-05,2021-09-06,2021-09-07,2021-09-08,2021-09-11,2021-09-12,2021-09-13,2021-09-14,2021-09-15,2021-09-18,2021-09-19,2021-09-21,2021-09-27,2021-09-29,2021-10-02,2021-10-04,2021-10-05,2021-10-06,2021-10-11,2021-10-12,2021-10-13,2021-10-17,2021-10-18,2021-10-20,2021-10-23,2021-10-24,2021-10-25,2021-10-26 611 5.462347e+04 20.970540000 *
##      7) Date=2020-11-27,2020-11-28,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-05,2020-12-06,2020-12-07,2020-12-08,2020-12-09,2020-12-10,2020-12-11,2020-12-16,2020-12-17,2020-12-18,2020-12-24,2020-12-25,2020-12-26,2020-12-31,2021-01-01,2021-01-02,2021-01-05,2021-01-06,2021-01-07,2021-01-08,2021-01-09,2021-01-10,2021-01-12,2021-01-13,2021-01-14,2021-01-15,2021-01-16,2021-01-18,2021-01-19,2021-01-20,2021-01-22,2021-01-23,2021-01-28,2021-01-29,2021-01-30,2021-01-31,2021-02-01,2021-02-02,2021-02-04,2021-03-08,2021-03-10,2021-03-19,2021-03-22,2021-03-23,2021-03-25,2021-03-26,2021-03-30,2021-04-01,2021-04-02,2021-04-06,2021-04-08,2021-04-09,2021-05-14,2021-05-15,2021-05-16,2021-05-18,2021-05-20,2021-05-21,2021-05-26,2021-05-27,2021-05-29,2021-06-03,2021-06-04,2021-06-09,2021-06-10,2021-06-11,2021-06-14,2021-06-17,2021-06-18,2021-06-21,2021-06-24,2021-06-25,2021-06-30,2021-07-01,2021-07-02,2021-07-03,2021-07-08,2021-07-09,2021-07-22,2021-07-23,2021-07-25,2021-07-28,2021-07-29,2021-07-30,2021-08-04,2021-08-05,2021-08-11,2021-08-12,2021-08-13,2021-08-14,2021-08-19,2021-08-20,2021-08-21,2021-08-26,2021-08-27,2021-08-29,2021-09-01,2021-09-02,2021-09-03,2021-09-04,2021-09-09,2021-09-10,2021-09-16,2021-09-17,2021-09-22,2021-09-23,2021-09-24,2021-09-25,2021-09-28,2021-09-30,2021-10-01,2021-10-08,2021-10-09,2021-10-14,2021-10-15,2021-10-16,2021-10-21,2021-10-22,2021-10-27,2021-10-28 621 2.020447e+05 37.529790000  
##       14) Time=17:00,18:00,19:00 256 5.171096e+04 28.011720000  
##         28) Date=2020-12-06,2020-12-09,2020-12-16,2021-01-05,2021-01-10,2021-01-12,2021-01-13,2021-01-18,2021-01-28,2021-02-01,2021-03-08,2021-03-10,2021-03-23,2021-03-25,2021-03-26,2021-03-30,2021-04-01,2021-04-08,2021-04-09,2021-05-14,2021-05-15,2021-05-16,2021-05-18,2021-05-20,2021-05-21,2021-05-26,2021-05-27,2021-05-29,2021-06-03,2021-06-04,2021-06-11,2021-06-14,2021-06-17,2021-06-21,2021-06-24,2021-06-25,2021-06-30,2021-07-01,2021-07-02,2021-07-03,2021-07-08,2021-07-22,2021-07-23,2021-07-25,2021-07-28,2021-07-29,2021-07-30,2021-08-04,2021-08-05,2021-08-11,2021-08-12,2021-08-19,2021-08-20,2021-08-26,2021-08-29,2021-09-01,2021-09-02,2021-09-09,2021-09-22,2021-09-30,2021-10-01,2021-10-14,2021-10-21,2021-10-27,2021-10-28 131 8.283740e+03 18.587790000 *
##         29) Date=2020-11-27,2020-11-28,2020-12-01,2020-12-02,2020-12-03,2020-12-05,2020-12-07,2020-12-08,2020-12-10,2020-12-11,2020-12-17,2020-12-18,2020-12-24,2020-12-25,2020-12-26,2020-12-31,2021-01-01,2021-01-02,2021-01-06,2021-01-07,2021-01-08,2021-01-09,2021-01-14,2021-01-15,2021-01-19,2021-01-20,2021-01-22,2021-01-23,2021-01-29,2021-01-30,2021-01-31,2021-02-02,2021-02-04,2021-03-19,2021-03-22,2021-04-02,2021-06-10,2021-06-18,2021-07-09,2021-08-13,2021-08-14,2021-08-21,2021-08-27,2021-09-03,2021-09-04,2021-09-10,2021-09-16,2021-09-17,2021-09-23,2021-09-24,2021-09-25,2021-09-28,2021-10-08,2021-10-09,2021-10-15,2021-10-16,2021-10-22 125 1.960043e+04 37.888000000 *
##       15) Time=20:00,21:00,22:00,23:00 365 1.108756e+05 44.205480000  
##         30) Date=2020-11-27,2020-11-28,2020-12-01,2020-12-02,2020-12-03,2020-12-04,2020-12-05,2020-12-06,2020-12-07,2020-12-09,2020-12-16,2020-12-17,2020-12-18,2020-12-24,2020-12-25,2020-12-26,2020-12-31,2021-01-02,2021-01-09,2021-01-10,2021-01-12,2021-01-16,2021-01-18,2021-01-19,2021-01-22,2021-01-23,2021-01-30,2021-01-31,2021-02-01,2021-02-02,2021-02-04,2021-03-10,2021-03-19,2021-03-22,2021-03-25,2021-03-26,2021-03-30,2021-04-01,2021-04-06,2021-05-14,2021-05-16,2021-05-18,2021-05-20,2021-05-29,2021-06-03,2021-06-09,2021-06-14,2021-06-18,2021-06-21,2021-06-30,2021-07-02,2021-07-03,2021-07-08,2021-07-25,2021-07-28,2021-08-04,2021-08-05,2021-08-11,2021-08-14,2021-08-21,2021-08-27,2021-08-29,2021-09-01,2021-09-04,2021-09-25,2021-09-28,2021-10-09,2021-10-16,2021-10-22,2021-10-27 211 3.011762e+04 35.232230000 *
##         31) Date=2020-12-08,2020-12-10,2020-12-11,2021-01-01,2021-01-05,2021-01-06,2021-01-07,2021-01-08,2021-01-13,2021-01-14,2021-01-15,2021-01-20,2021-01-28,2021-01-29,2021-03-08,2021-03-23,2021-04-02,2021-04-08,2021-04-09,2021-05-15,2021-05-21,2021-05-26,2021-05-27,2021-06-04,2021-06-10,2021-06-11,2021-06-17,2021-06-24,2021-06-25,2021-07-01,2021-07-09,2021-07-22,2021-07-23,2021-07-29,2021-07-30,2021-08-12,2021-08-13,2021-08-19,2021-08-20,2021-08-26,2021-09-02,2021-09-03,2021-09-09,2021-09-10,2021-09-16,2021-09-17,2021-09-22,2021-09-23,2021-09-24,2021-09-30,2021-10-01,2021-10-08,2021-10-14,2021-10-15,2021-10-21,2021-10-28 154 4.049050e+04 56.500000000 *
# train model
results <- fit_resamples(dt_mod, model_recipe, kfold)

# model results
collect_metrics(results)
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   8.21      5  0.210  Preprocessor1_Model1
## 2 rsq     standard   0.683     5  0.0135 Preprocessor1_Model1
dt_fit %>%
  pull_workflow_fit() %>%
  vip(20)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.

Time and Date are the most important variables.

The effect of the hour on the number of visitors

Time_vig <- plot_ly(visit.dat, y =  visit.dat$ValueIn, type = 'bar', color = ~Time) %>%
  layout( barmode = 'stack')
Time_vig
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
  • We can see that there are many customers at night from 7pm until 12pm, but the highest number of visitors were at 22pm and 23pm.

The effect of the day on the number of visitors

ggplot(visit.dat, aes(fill = factor(Day), max(ValueIn))) +
  geom_bar(position = "dodge") +
  NULL

  • weekends have the highest number of visitors.

Time Series Analysis

###################################
# Time Series For visitor count
###################################


# Taking out the observations from 2020-11-26 until 2020-11-30 since there wont be monthly affect for this month

visit.dat <- slice(visit.dat,-c(1:120))

# Using aggregate() function to prepare dataset for plotting and time series analysis

visit.dat <- aggregate(ValueIn ~ Day + Time , visit.dat, mean)
                       
## Create a daily Date object - helps my work on dates
inds <- seq(as.Date("2020-12-01"), as.Date("2021-10-29"), by = "day")

## Create a time series object
set.seed(25)
v.series <- ts(visit.dat[,3],     
                start = c(2020, as.numeric(format(inds[1], "%j"))))



###########################################
# Preliminary Analysis
###########################################
# Time Plot 
autoplot(v.series) + ggtitle("Time Plot: Daily visitors") 

autoplot(v.series) +
  ggtitle("Daily visitors") +
  xlab("(visiting month)") + ylab("(visitors Records)") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# Data has some trend. Investigate transformations. 
#Take the first difference of the data to remove the trend.

DY <- diff(v.series)

# Time plot of the differenced data 

autoplot(DY) + 
  ggtitle("Time Plot: Change in Daily visitors") +
  ylab("number of visitors")

# Series appears stationary, use to investigate seasonality.

# #not working
# ggseasonplot(DY) +
#   ggtitle("Seasonal Plot: Change in Daily number of visitors") 


# subseries plot
# not working 
# ggsubseriesplot(DY) #Data are not seasonal


############################################
# ARMA MODEL
############################################

#This will plot the time series
ts.plot(v.series, xlab="month", ylab="Daily visitors", main="Daily visitors")

# This will fit in a line
abline(reg=lm(v.series~time(v.series)))

#Auto correlation
acf(v.series)  #it describes how well the present value of the series is related with its past values

############################################
#Fitting the AR Model to the time series
############################################

AR <- arima(v.series, order = c(1,0,0))
print(AR)
## 
## Call:
## arima(x = v.series, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.7592     8.3312
## s.e.  0.0495     2.4213
## 
## sigma^2 estimated as 59.21:  log likelihood = -581.62,  aic = 1169.24
#plotting the series along with the fitted values
ts.plot(v.series)
AR_fit <- v.series - residuals(AR)
points(AR_fit, type = "l", col = 2, lty = 2)

#Forcasting using AR model

#Using predict() to make a 1-step forecast
predict_AR <- predict(AR)

#Obtaining the 1-step forecast using $pred[1]
predict_AR$pred[1]
## [1] 2.607299
#ALternatively Using predict to make 1-step through 10-step forecasts
predict(AR, n.ahead = 10)
## $pred
## Time Series:
## Start = 2523 
## End = 2532 
## Frequency = 1 
##  [1] 2.607299 3.985703 5.032168 5.826630 6.429775 6.887674 7.235305 7.499222
##  [9] 7.699583 7.851695
## 
## $se
## Time Series:
## Start = 2523 
## End = 2532 
## Frequency = 1 
##  [1]  7.694655  9.660896 10.630213 11.150680 11.439907 11.603332 11.696487
##  [8] 11.749843 11.780485 11.798110
#plotting the series plus the forecast and 95% prediction intervals
ts.plot(v.series)
AR_forecast <- predict(AR, n.ahead = 10)$pred
AR_forecast_se <- predict(AR, n.ahead = 10)$se
points(AR_forecast, type = "l", col = 2)
points(AR_forecast - 2*AR_forecast_se, type = "l", col = 2, lty = 2)
points(AR_forecast + 2*AR_forecast_se, type = "l", col = 2, lty = 2)

############################################
#Fit the MA model 
############################################
#Fitting the MA model to AirPassengers
MA <- arima(v.series, order = c(0,0,1))
print(MA)
## 
## Call:
## arima(x = v.series, order = c(0, 0, 1))
## 
## Coefficients:
##          ma1  intercept
##       0.4743     8.5748
## s.e.  0.0521     1.1082
## 
## sigma^2 estimated as 95.29:  log likelihood = -621.29,  aic = 1248.58
#plotting the series along with the MA fitted values
ts.plot(v.series)
MA_fit <- v.series - resid(MA)
points(MA_fit, type = "l", col = 2, lty = 2)

#Forcasting using MA model
#Making a 1-step forecast based on MA
predict_MA <- predict(MA)

#Obtaining the 1-step forecast using $pred[1]
predict_MA$pred[1]
## [1] 5.978835
#Alternately Making a 1-step through 10-step forecast based on MA
predict(MA,n.ahead=10)
## $pred
## Time Series:
## Start = 2523 
## End = 2532 
## Frequency = 1 
##  [1] 5.978835 8.574797 8.574797 8.574797 8.574797 8.574797 8.574797 8.574797
##  [9] 8.574797 8.574797
## 
## $se
## Time Series:
## Start = 2523 
## End = 2532 
## Frequency = 1 
##  [1]  9.761522 10.803969 10.803969 10.803969 10.803969 10.803969 10.803969
##  [8] 10.803969 10.803969 10.803969
#Plotting the AIrPAssenger series plus the forecast and 95% prediction intervals
ts.plot(v.series)
MA_forecasts <- predict(MA, n.ahead = 10)$pred
MA_forecast_se <- predict(MA, n.ahead = 10)$se
points(MA_forecasts, type = "l", col = 2)
points(MA_forecasts - 2*MA_forecast_se, type = "l", col = 2, lty = 2)
points(MA_forecasts + 2*MA_forecast_se, type = "l", col = 2, lty = 2)

#Choosing AR or MA: Exploiting ACF plots
# Find correlation between AR_fit and MA_fit
cor(AR_fit, MA_fit)
## [1] 0.9269574
# Find AIC of AR
AIC(AR)   #1169.239
## [1] 1169.239
# Find AIC of MA
AIC(MA) # 1248.577
## [1] 1248.577
# Find BIC of AR
BIC(AR)   #1178.611
## [1] 1178.611
# Find BIC of MA
BIC(MA)  #1257.949
## [1] 1257.949
#According to the lowest values of the AIC and BIC, I will go with the AR model.


############################################
#An ARIMA MODEL
############################################

## use auto.arima to choose ARIMA terms
fit.arima <- auto.arima(v.series,d=1,D=1, stepwise = FALSE, approximation = FALSE, trace = TRUE)  # d=1 (the first difference), trace prints all the models, Residual SD = 6.472248
## 
##  ARIMA(0,1,0)                    : 1179.261
##  ARIMA(0,1,0) with drift         : 1181.309
##  ARIMA(0,1,1)                    : 1122.093
##  ARIMA(0,1,1) with drift         : 1124.167
##  ARIMA(0,1,2)                    : 1121.729
##  ARIMA(0,1,2) with drift         : 1123.828
##  ARIMA(0,1,3)                    : 1123.683
##  ARIMA(0,1,3) with drift         : 1125.808
##  ARIMA(0,1,4)                    : 1119.371
##  ARIMA(0,1,4) with drift         : 1121.523
##  ARIMA(0,1,5)                    : 1106.666
##  ARIMA(0,1,5) with drift         : 1108.845
##  ARIMA(1,1,0)                    : 1127.438
##  ARIMA(1,1,0) with drift         : 1129.512
##  ARIMA(1,1,1)                    : 1121.405
##  ARIMA(1,1,1) with drift         : 1123.505
##  ARIMA(1,1,2)                    : 1122.731
##  ARIMA(1,1,2) with drift         : 1124.857
##  ARIMA(1,1,3)                    : 1124.714
##  ARIMA(1,1,3) with drift         : 1126.867
##  ARIMA(1,1,4)                    : 1107.301
##  ARIMA(1,1,4) with drift         : 1109.48
##  ARIMA(2,1,0)                    : 1125.1
##  ARIMA(2,1,0) with drift         : 1127.199
##  ARIMA(2,1,1)                    : 1123.235
##  ARIMA(2,1,1) with drift         : 1125.361
##  ARIMA(2,1,2)                    : 1124.805
##  ARIMA(2,1,2) with drift         : 1126.957
##  ARIMA(2,1,3)                    : 1126.901
##  ARIMA(2,1,3) with drift         : 1129.081
##  ARIMA(3,1,0)                    : 1121.027
##  ARIMA(3,1,0) with drift         : 1123.153
##  ARIMA(3,1,1)                    : 1123.139
##  ARIMA(3,1,1) with drift         : 1125.291
##  ARIMA(3,1,2)                    : 1107.077
##  ARIMA(3,1,2) with drift         : 1109.257
##  ARIMA(4,1,0)                    : 1123.125
##  ARIMA(4,1,0) with drift         : 1125.277
##  ARIMA(4,1,1)                    : 1125.305
##  ARIMA(4,1,1) with drift         : 1127.484
##  ARIMA(5,1,0)                    : 1123.481
##  ARIMA(5,1,0) with drift         : 1125.66
## 
## 
## 
##  Best model: ARIMA(0,1,5)
#Model evaluation
print(summary(fit.arima))
## Series: v.series 
## ARIMA(0,1,5) 
## 
## Coefficients:
##           ma1     ma2      ma3     ma4      ma5
##       -0.5187  0.0246  -0.2964  0.5532  -0.2723
## s.e.   0.0752  0.0719   0.0649  0.0793   0.0643
## 
## sigma^2 estimated as 41.89:  log likelihood=-547.07
## AIC=1106.14   AICc=1106.67   BIC=1124.85
## 
## Training set error measures:
##                        ME     RMSE      MAE MPE MAPE      MASE        ACF1
## Training set -0.004934802 6.355837 3.185637 NaN  Inf 0.7677304 -0.03316768
checkresiduals(fit.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,5)
## Q* = 30.16, df = 5, p-value = 1.372e-05
## 
## Model df: 5.   Total lags used: 10
## plot it
plot(forecast_arima)

## forecast for next 60 time points
forecast_arima <- forecast(fit.arima, h = 100)

###########################################
#Fit ETS method
###########################################
fit_ets <- ets(v.series) # Residual SD = 6.849643
print(summary(fit_ets))
## ETS(A,N,N) 
## 
## Call:
##  ets(y = v.series) 
## 
##   Smoothing parameters:
##     alpha = 0.4156 
## 
##   Initial states:
##     l = 1.4505 
## 
##   sigma:  6.8908
## 
##      AIC     AICc      BIC 
## 1513.356 1513.502 1522.728 
## 
## Training set error measures:
##                        ME     RMSE     MAE  MPE MAPE      MASE        ACF1
## Training set -0.007053437 6.849643 3.48906 -Inf  Inf 0.8408546 -0.07499169
checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,N)
## Q* = 54.674, df = 8, p-value = 5.109e-09
## 
## Model df: 2.   Total lags used: 10
############################################
# Facebook Prophet
#not working well on this data
############################################

# Adjusting the data

visit.dat <- read.csv("~/Documents/GitHub/DS_Capstone_Cafe_Analysis/Hourlydata.csv")

# Taking out the observations from 2020-11-26 until 2020-11-30 since there wont be monthly affect for this month

visit.dat <- slice(visit.dat,-c(1:120))

visit.dat <- select(visit.dat, c(1,3))

names(visit.dat) <- c('ds', 'y')

# Making a Forecast

fb.ts <- prophet(visit.dat, yearly.seasonality=FALSE ,daily.seasonality=FALSE)

future <- make_future_dataframe(fb.ts , periods=365)

tail(future)
##             ds
## 693 2022-10-24
## 694 2022-10-25
## 695 2022-10-26
## 696 2022-10-27
## 697 2022-10-28
## 698 2022-10-29
# Forecasting

forecast <- predict(fb.ts, future)

tail(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])
##             ds      yhat yhat_lower yhat_upper
## 693 2022-10-24  8.268133 -15.222656   32.10870
## 694 2022-10-25  8.625258 -14.634970   31.54612
## 695 2022-10-26  9.360516 -13.319796   32.98290
## 696 2022-10-27 12.472258 -10.028295   36.34948
## 697 2022-10-28 13.801397  -8.205279   38.34382
## 698 2022-10-29  9.756179 -14.117433   32.67423
plot(fb.ts, forecast)

prophet_plot_components(fb.ts, forecast)

Conclusion

To increase the number of visitors and thus profits, I suggest focusing on weekends as they tend to increase the number of people. Also, removing unprofitable products from then and increasing the number of profitable products may help.

Outlook for future development

Deploying the model is one of what I plan to do in the future to feed it with the new data after completing one year from the opening date.

Limitations & problems

The problem that I faced is that as I said, there is no seasonality in the data. I think that due to covid-19 lockdown and the limited number of customers that are allowed to enter the cafe. The Limitation was in the data since it is for a period that is less than a year.